[1] 3
Data Visualisation
Introduction to R
Create your first plot in R
Test your hypotheses using informative data visualizations
Source: R4DS
A great way to get to know your data
Critical to communicate your findings
R code:
Functions:
45, 978, and 121.67 divided by 6?894?Using the consol, find the summation of 45, 978, and 121.
Or:
What is 67 divided by 6?
What is the square root of 894?
Packages are collections of R functions, data, and compiled code in a well-defined format.
tidyverse packages.For your sanity’s sake, for your co-author’s sanity’s sake
Keeps everything:
Organised
Reproducible
Sustainable
getwd() to see where you are on your computer.Source: R4DS
Source: R4DS
Source: R4DS
From R4DS - Data Visualization:
Do cars with big engines use more fuel than cars with small engines?
This session will borrow (read: steal) heavily from Hadley Wickham’s R for Data Science book.
The. Best. Resource.
Hadley Wickham is one of the lead authors of the tidyverse. He created ggplot through his PhD dissertation.
library(ggplot2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
theme(
legend.position = "bottom",
panel.grid = element_blank(),
panel.background = element_blank(),
plot.title.position = "plot",
plot.title = element_text(face = "bold")
) +
labs(
title = "Relationship between engine displacement and highway miles per gallon by class",
x = "Engine displacement (L)",
y = "Highway miles per gallon",
color = "Class"
)| manufacturer | model | displ | year | cyl |
|---|---|---|---|---|
| audi | a4 | 1.8 | 1999 | 4 |
| audi | a4 | 1.8 | 1999 | 4 |
| audi | a4 | 2.0 | 2008 | 4 |
| audi | a4 | 2.0 | 2008 | 4 |
| audi | a4 | 2.8 | 1999 | 6 |
| audi | a4 | 2.8 | 1999 | 6 |
Learn more about this data set by typing ?mpg into your console.
mpg data setA couple of useful variables:
displ: engine displacement, in litres
hwy: highway miles per gallon
What do you see when you run the following?
How many rows are in mpg? How many columns?
What does the drv variable describe?
Make a scatterplot of hwy vs cyl.
What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
We are not restricted to looking at only two interesting elements of our data.
You can use visual elements or aesthetics (aes) to communicate many dimensions in your data.
Let’s look at a categorical variable: the class of car (SUV, 2 seater, pick up truck, etc.).
Look for meaningfully defined groups.
You can use visual elements to communicate your findings in engaging ways.
What’s gone wrong with this code? Why are the points not blue?
Which variables in mpg are categorical? Which variables are continuous?
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
What happens if you map the same variable to multiple aesthetics?
Less is more when it comes to data visualization.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
theme(
legend.position = "bottom",
panel.grid = element_blank(),
panel.background = element_blank(),
plot.title.position = "plot",
plot.title = element_text(face = "bold")
) +
labs(
title = "Relationship between engine displacement and highway miles per gallon by class",
x = "Engine displacement (L)",
y = "Highway miles per gallon",
color = "Class"
)Create a scatterplot of hwy vs displ and a categorical variable in the mpg data set.
Customize your plot using the theme() argument.
What happens if you facet on a continuous variable?
What do the empty cells in previous plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
. do?This session you:
Set up your data science tools
Plotted complex data in an engaging way
Discovered interesting relationships in the data
Connected these relationships or trends to your expectations (or hypotheses about the data)